Sequence Molecules (DNA, RNA & Protein Sequences)

Sequence molecules are DNA, RNA and protein molecules whose structures are determined by an underlying molecular sequence. They are derived from DNA, RNA and Protein classes in the bioseq module. Note that any instantiation from these classes refers to a single strand of bases. For multi-stranded objects like double stranded DNA or DNA-RNA complexes, each strand will have to be instantiated separately.

Internally, the type hierarchies for DNA, RNA and Protein are

  • Molecule -> SequenceMolecule -> Polynucleotide -> DNA,RNA
  • Molecule -> SequenceMolecule -> Polypeptide -> Protein

Methods Common to Sequence Molecules (DNA, RNA and Protein).

All SequenceMolecule objects have a sequence attribute, which holds a reference to a Bio.Seq.Seq object from Biopython. During instantiation, set the use_permissive_alphabet to indicate whether a permissive alphabet is to be used (default) or a strict one, e.g., GATCRYWSMKHBVDN vs. GATC.

Instance Attribute Setter Getter Unsetter Modifier
id set_id() get_id()
sites add_sites(*sites) get_sites(**kwargs) remove_sites(*sites)
sequence set_sequence(inputstr) get_sequence(**kwargs)
get_sequence_length()
replace_sequence(**kwargs)
delete_sequence(**kwargs)
insert_sequence(**kwargs)

The method get_sequence() has the input signature

get_sequence(start=None,end=None,length=None,as_string=False).

Sequences are indexed like Python strings, and a subsequence can be located given a (start,end) coordinate or a (start,length) coordinate. If both end and length are provided, length is ignored. as_string indicates whether the output is a pure Python string or a Bio.Seq.Seq object (by default).


In [1]:
# create a DNA molecule with a particular sequence
from wc_rules.bioseq import DNA, RNA, Protein
inputstr = 'TTGTTATCGTTACCGGGAGTGAGGCGTCCGCGTCCCTTTCAGGTCAAGCGACTGAAAAACCTTGCAGTTGATTTTAAAGCGTATAGAAGACAATACAGA'
dna1 = DNA(use_permissive_alphabet=False).set_sequence(inputstr)
dna1.get_sequence()


Out[1]:
Seq('TTGTTATCGTTACCGGGAGTGAGGCGTCCGCGTCCCTTTCAGGTCAAGCGACTG...AGA', IUPACUnambiguousDNA())

In [2]:
# Get entire sequence
dna1.get_sequence()


Out[2]:
Seq('TTGTTATCGTTACCGGGAGTGAGGCGTCCGCGTCCCTTTCAGGTCAAGCGACTG...AGA', IUPACUnambiguousDNA())

In [3]:
# Get a subsequence using (start,end)
dna1.get_sequence(start=90,end=99)


Out[3]:
Seq('CAATACAGA', IUPACUnambiguousDNA())

In [4]:
# Get a subsequence using (start,length)
dna1.get_sequence(start=90,length=9)


Out[4]:
Seq('CAATACAGA', IUPACUnambiguousDNA())

In [5]:
# Get a subsequence by unpacking a dict
loc = dict(start=90,end=99)
dna1.get_sequence(**loc)


Out[5]:
Seq('CAATACAGA', IUPACUnambiguousDNA())

In [6]:
# Output as string
dna1.get_sequence(start=90,end=99,as_string=True)


Out[6]:
'CAATACAGA'

In [7]:
# Get sequence length
dna1.get_sequence_length()


Out[7]:
99

In [8]:
# Get subsequence length, only (start,end) allowed
dna1.get_sequence_length(start=90,end=99)


Out[8]:
9

Methods common to Polynucleotide molecules (DNA and RNA)

Polynucleotide objects, (i.e., DNA and RNA) have the following additional methods that read the molecular sequence, perform alphabet conversion, and return a sequence object (Bio.Seq.Seq):

  • get_DNA(**kwargs), returns a DNA sequence
  • get_RNA(**kwargs), returns an RNA sequence
  • get_protein(**kwargs), returns a protein sequence

The following kwargs are common to all three methods: start=None, end=None, length=None, as_string=False, option='coding|complementary|reverse_complementary'.

start,end,length kwargs behave exactly the same as for get_sequence().

The option kwarg indicates how the sequence is processed:

  • option=coding calls get_sequence(), then performs alphabet conversion (default),
  • option=complementary calls get_sequence(), converts to complement, then performs alphabet conversion,
  • option=reverse_complementary calls get_sequence(), converts to reverse complement, then performs alphabet conversion.

The get_protein() method has additional kwargs table=1,to_stop=False, which follow the recipe for the Biopython method translate().


In [9]:
inputstr = 'TTGTTATCGTTACCGGGAGTGAGGCGTCCGCGTCCCTTTCAGGTCAAGCGACTGAAAAACCTTGCAGTTGATTTTAAAGCGTATAGAAGACAATACAGA'
dna1 = DNA(use_permissive_alphabet=False).set_sequence(inputstr)
dna1.get_sequence()


Out[9]:
Seq('TTGTTATCGTTACCGGGAGTGAGGCGTCCGCGTCCCTTTCAGGTCAAGCGACTG...AGA', IUPACUnambiguousDNA())

In [10]:
# Converting reverse complement to RNA, then initializing an RNA molecule
seq1 = dna1.get_rna(option='reverse_complementary')
rna1 = RNA(use_permissive_alphabet=False).set_sequence(seq1)
rna1.get_sequence()


Out[10]:
Seq('UCUGUAUUGUCUUCUAUACGCUUUAAAAUCAACUGCAAGGUUUUUCAGUCGCUU...CAA', IUPACUnambiguousRNA())

In [11]:
# Converting coding sequence to protein, then initializing a protein molecule
seq1 = dna1.get_protein()
prot1 = Protein(use_permissive_alphabet=False).set_sequence(seq1)
prot1.get_sequence()


Out[11]:
Seq('LLSLPGVRRPRPFQVKRLKNLAVDFKAYRRQYR', IUPACProtein())

In [12]:
# Converting only a subset of the coding sequence to protein
seq1 = dna1.get_protein(start=66,end=99)
prot1 = Protein(use_permissive_alphabet=False).set_sequence(seq1)
prot1.get_sequence()


Out[12]:
Seq('VDFKAYRRQYR', IUPACProtein())